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Abstract 

We provide a new proof of the linear convergence 
of the alternating direction method of multipli¬ 
ers (ADMM) when one of the objective terms is 
strongly convex. Our proof is based on a frame¬ 
work for analyzing optimization algorithms in¬ 
troduced in Lessard et al. (2014), reducing al¬ 
gorithm convergence to verifying the stability of 
a dynamical system. This approach generalizes 
a number of existing results and obviates any 
assumptions about specific choices of algorithm 
parameters. On a numerical example, we demon¬ 
strate that minimizing the derived bound on the 
convergence rate provides a practical approach 
to selecting algorithm parameters for particular 
ADMM instances. We complement our upper 
bound by constructing a nearly-matching lower 
bound on the worst-case rate of convergence. 

1. Introduction 

The alternating direction method of multipliers (ADMM) 
seeks to solve the problem 

minimize f{x)+g{z) 
subject to Ax + Bz = c, 

with variables x G M.P and z G and constants A G 
IR’’XP, B G IR’'X9, and c S K’'. ADMM was introduced in 
Glowinski & Marroco (1975) and Gabay & Mercier (1976). 
More recently, it has found applications in a variety of 
distributed settings such as model htting, resource alloca¬ 
tion, and classihcation. A partial list of examples includes 
Bioucas-Dias & Figueiredo (2010); Wahlberg et al. (2012); 
Bird (2014); Forero et al. (2010); Sedghi et al. (2014); Li 
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et al. (2014); Wang & Banerjee (2012); Zhang et al. (2012); 
Meshi & Globerson (201 1); Wang et al. (2013); Aslan et al. 
(2013); Forouzan & Ihler (2013); Romera-Paredes & Pon- 
til (2013); Behmardi et al. (2014); Zhang & Kwok (2014). 
See Boyd et al. (201 1) for an overview. 

Part of the appeal of ADMM is the fact that, in many con¬ 
texts, the algorithm updates lend themselves to parallel im¬ 
plementations. The algorithm is given in Algorithm 1. We 
refer to p > 0 as the step-size parameter. 


Algorithm 1 Alternating Direction Method of Multipliers 
1: Input: functions / and g, matrices A and B, vector c, 
parameter p 

2: Initialize xq, 20) uo 

3: repeat 

4: Xk+i = argmin^ f{x) + ^\\Ax + Bzk -c + UkW^ 

5: Zk+i = g{z) + §\\Axk+i+ Bz- c + Uk\\‘^ 

6: Uk+i = Uk + Axk+i + Bzk+i - c. 

7: until meet stopping criterion 


A popular variant of Algorithm 1 is over-relaxed ADMM, 
which introduces an additional parameter a and replaces 
each instance of Ax^+i in the z and u updates in Algo¬ 
rithm 1 with 

aAxk+i - (1 - a){Bzk - c). 

The parameter a is typically chosen to lie in the inter¬ 
val (0, 2], but we demonstrate in Section 8 that a larger set 
of choices can lead to convergence. Over-relaxed ADMM 
is described in Algorithm 2. When a = 1, Algorithm 2 and 
Algorithm 1 coincide. We will analyze Algorithm 2. 

The conventional wisdom that ADMM works well with¬ 
out any tuning (Boyd et al., 2011), for instance by set¬ 
ting p = 1, is often not borne out in practice. Algorithm 1 
can be challenging to tune, and Algorithm 2 is even harder. 
We use the machinery developed in this paper to make rea¬ 
sonable recommendations for setting p and a when some 
information about / is available (Section 8). 
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Algorithm 2 Over-Relaxed Alternating Direction Method 
of Multipliers 

1 : Input: functions / and g, matrices A and B, vector c, 
parameters p and a 

2: Initialize xo, ^0) uo 

3: repeat 

4: Xk+i = argmin^ f(x) + ^\\Ax + Bzk - c + Ufe|p 

5 : Zk+i = dxgTain^g{z) + ^\\aAxk+i-{l-a)Bzk + 

Bz — ac + Mfcp 

6 : Uk+i = Uk + aAxk+i - (1 - a)Bzk + Bzk+i - ac 

1: until meet stopping criterion 


In this paper, we give an upper bound on the linear rate of 
convergence of Algorithm 2 for all p and a (Theorem 7), 
and we give a nearly-matching lower bound (Theorem 8 ). 

Importantly, we show that we can prove convergence rates 
for Algorithm 2 by numerically solving a 4 x 4 semidefinite 
program (Theorem 6 ). When we change the parameters of 
Algorithm 2, the semidefinite program changes. Whereas 
prior work requires a new proof of convergence for every 
change to the algorithm, our work automates that process. 

Our work builds on the integral quadratic constraint frame¬ 
work introduced in Lessard et al. (2014), which uses ideas 
from robust control to analyze optimization algorithms that 
can be cast as discrete-time linear dynamical systems. Re¬ 
lated ideas, in the context of feedback control, appear in 
Corless (1990); D’Alto & Corless (2013). Our work pro¬ 
vides a flexible framework for analyzing variants of Algo¬ 
rithm 1, including those like Algorithm 2 created by the in¬ 
troduction of additional parameters. In Section 7, we com¬ 
pare our results to prior work. 

2. Preliminaries and Notation 

Let K denote the extended real numbers K U {-|-(X)}. Sup¬ 
pose that f ■. > K is convex and differentiable, and 

let V/ denote the gradient of /. We say that / is strongly 
convex with parameter m > 0 if for all cc, y S we have 

f{x) > f{y) + Vfiy^ix -y) + f\\x- yf. 

When V/ is Lipschitz continuous with parameter L, then 

f{x) < f{y) + V/(?/)^(x -y) + ^\\x- j/f. 

For 0 < TO < L < 00 , let Sd{m, L) denote the set of dif¬ 
ferentiable convex functions f: —>■ M. that are strongly 

convex with parameter to and whose gradients are Lips¬ 
chitz continuous with parameter L. We let Sd{0,oo) de¬ 
note the set of convex functions —>^ M. In general, we 
let df denote the subdifferential of /. We denote the d- 
dimensional identity matrix by Id and the d-dimensional 
zero matrix by 0^. We will use the following results. 


Lemma 1. Suppose that f € Sdim,L), where 0 < to < 
L < oo. Suppose that bi = V/(ai) and 62 = V/(a 2 ). 
Then 


ai — 02 

T 

—2mLId 

(to -f L)Id 

Oi - 02 

h - b2_ 


(to -I- L)Id 

-2Id 

_bi - & 2 _ 


Proof. The Lipschitz continuity of V/ implies the co- 
coercivity of V/, that is 

(oi - a2)^(6i - 62) > z\\bi - &2|P- 

Note that /(x) — ||x|p is convex and its gradient is Lip¬ 

schitz continuous with parameter L — m. Applying the co- 
coercivity condition to this function and rearranging gives 

(TO-fL)(ai-a2)^(&i-&2) > TOL||ol-a2|P + ||^>l- 62 ||^ 

which can be put in matrix form to complete the proof. □ 

Lemma 2. Suppose that / S Sd{0,oo), and suppose 
that bi G df{ai) and 62 £ Then 


Oi — 02 

T 

Od Id 


Oi — O2 

61 — b2_ 


1 

0 


bi — b2_ 


Lemma 2 is simply the statement that the subdifferential of 
a convex function is a monotone operator. 

When M is a matrix, we use km to denote the condi¬ 
tion number of M. For example, ka = ai{A)/ap{A), 
where ai{A) and (Jp{A) denote the largest and smallest 
singular values of the matrix A. When / S Sd{m, L), we 
let Kf = — denote the condition number of /. We denote 
the Kronecker product of matrices M and N hy M ^ N. 

3. ADMM as a Dynamical System 

We group our assumptions together in Assumption 3. 

Assumption 3. We assume that f and g are convex, closed, 
and proper. We assume that for some 0 < to < L < 00 , we 
have f £ Sp{m, L) and g £ Sq(0, 00 ). We assume that A 
is invertible and that B has full column rank. 

The assumption that / and g are closed (their sublevel sets 
are closed) and proper (they neither take on the value —00 
nor are they uniformly equal to -|-c») is standard. 

We begin by casting over-relaxed ADMM as a discrete¬ 
time dynamical system with state sequence {^k), input se¬ 
quence {vk), and output sequences (w].) and {w\) satisfy¬ 
ing the recursions 

ik+i = (A 0 Ir)S,k + {B® Ir)vk ( 2 a) 

wl = (C^ 0 Ir)^k + <S> Ir)l'k ( 2 b) 

wl = 0 Ir)^k + 0'^ <A> Ir)vk ( 2 c) 
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for particular matrices A, B, C^, D^, (7^, and (whose where I3k+i = V/(rfc_|_i). In the same spirit, we rewrite 
dimensions do not depend on any problem parameters). the update rule for z as 


First define the functions f,g: K’’ —>■ K via 

f = {p-^f)oA-^ 

9 = {p~^9)°B^ 


where is any left inverse of B and where lims is 
the {0, oo}-indicator function of the image of B. We de¬ 
fine K = Kfn\ and to normalize we define 


aliA) 


L = 


L 


crl{A) 


p={mL)2pQ. (4) 


Note that under Assumption 3, 


Sfc+i = argmin5(s) 

S 

+ i||arfc+i - (1 - a)sk + s - ac + Ufc|p. 
It follows that there exists some 7 fe+i C dg{sk+i) such that 


0 = 7 fc+i + ark+i - (1 - a)sk + st+i - ac + Uk- 


It follows then that 


Sk+i = -avk+i + (1 - a)sk + ac-Uk- 7fe-ri 
= Sfc - (1 - a)uk + a/Sk+i - Jk+i, 


( 8 ) 


/eS'p(poV (5a) 

pe5,(0,oo). (5b) 


To define the relevant sequences, let the sequences 
(xk), (zk), and {uk) be generated by Algorithm 2 with 
parameters a and p. Define the sequences (r^) and (sk) 
by Xk = Axk and Sk = Bzk and the sequence {^k) by 


We define the sequence (z^^) as in Proposition 4. 

Proposition 4. There exist sequences {Pk) ond (jk) 
with Pk = ^f{xk) and jk C dg{sk) such that when we 
define the sequence {vk) by 


where the second equality follows by substituting in (7). 
Combining (7) and (8) to simplify the u update, we have 

%lk+l — tik “t“ OiTk-^-l (1 A)Sk T -Sfc-l-l CXC 
= -lk+1- 

Together, (8) and (9) confirm the relation in (2a). □ 

Corollary 5. Define the sequences (Pk) and (y/k) as in 
Proposition 4. Define the sequences {w]f) and {vjf.) via 


Xk+l - C 


Sfe-l-1 

1 

+ 

_1 

^k — 

lk+l_ 


Then the sequences {fk), A'k), (wl.), and {wf.) satisfy (2b) 
and (2c) with the matrices 


Vk = 


Pk+l 

lk+1 


then (pk) and (I'k) satisfy (2a) with the matrices 

A = 


1 

a — 1 


a 

-1 

0 

0 

B = 

0 

-1 


(6) 


Proof. Using the fact that A has full rank, we rewrite the 
update rule for x from Algorithm 2 as 

Xk+i = A~^ argmin/(A“V) + f ||r + Sk - c + Ufe|p. 

r 

Multiplying through by A, we can write 


-- 


0 0 

1 a — 1 
0 0 


= 

-1 

O' 

1 

0 

= 

a 

-1' 

0 

1 


( 10 ) 


4. Convergence Rates from Semidefinite 
Programming 

Now, in Theorem 6, we make use of the perspective de¬ 
veloped in Section 3 to obtain convergence rates for Algo¬ 
rithm 2. This is essentially the same as the main result of 
Lessard et al. (2014), and we include it because it is simple 
and self-contained. 


rk+i = argmin/(r) -f l\\r + Sk -c + Uk\\'^. 

r 

This implies that 

0 = V/(rfc+i) -b Tk+i + Sk - c^rUk, 


Theorem 6. Suppose that Assumption 3 holds. Let the 
sequences (xk), (zk), and {uk) be generated by running 
Algorithm 2 with step size p = {rhL) s pg and with over¬ 
relaxation parameter a. Suppose that (x*, z*, u*) is a fixed 
point of Algorithm 2, and define 


and so 


Xk+l — Sk Uk “b C Pk+l.) 


( 7 ) 


Pk = 


Zk 

Uk 


p* = 


M* 
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Fix 0 < T < 1, and suppose that there exist a 2 x 2 positive 
definite matrix P >- 0 and nonnegative constants A^, > 

0 such that the 4x4 linear matrix inequality 

\A^PA-t‘^P A^PIB 


B^ 

PA 


B^PB 



Cl 

b^' 

T 


0 

Cl pi' 

C2 

b^ 


0 A^M^ 

C^ 


For fixed values of a, po, uf, L, and r, the feasibility of (11) 
is a semidefinite program with variables P, A^, and A^. We 
perform a binary search over r to find the minimal rate t 
such that the linear matrix inequality in (11) is satisfied. 
The results are shown in Figure 1 for a wide range of con¬ 
dition numbers k, for a = 1.5, and for several choices 
of po- In Figure 2, we plot the values — 1/ log t to show the 
number of iterations required to achieve a desired accuracy. 


is satisfied, where A and B are defined in (6), 
where C^, D^, and are defined in (10), and 

where and are given by 


= 


-2po' 

+K5) 


— 1 / — i A \' 
Pq [K 2 + K2 ) 

-2 


= 


0 1 
1 0 


Then for all k>0, we have 
\Wk - T*\\ < 



Proof Define Sk, Pk, 7fc, Cfe, w\, and as before. 
Choose r* = Ax^, s* = Bz„, and 


wl = 


r* - c 

P* 


= 


such that , wl, w'^) is a fixed point of the dynam¬ 

ics of (2) and satisfying /3, = V/(r*), 7* G dg{Si,). 
Now, consider the Kronecker product of the right hand side 
of (11) and Ir- Multiplying this on the left and on the 
right by [(^j — {r'j — and its transpose, re¬ 

spectively, we find 


-h A^(u;] - wl)^M^{w] - wl) 
+ X^(w]-wl)^M^iw^-wl). 


Lemma 1 and (5 a) show that the third term on the right 
hand side of (12) is nonnegative. Lemma 2 and (5b) show 
that the fourth term on the right hand side of (12) is non¬ 
negative. It follows that 

(0+1 - O)^i"(0+1 - 0) < ^"(0 - - 0). 


Inducting from j = 0 to A: — 1, we see that 


Figure 1. For a = l.b and for several choices of e in po = kT, 
we plot the minimal rate r for which the linear matrix inequality 
in (11) is satisfied as a function of k. 



Figure 2. For a = 1.5 and for several choices of e in po = we 
compute the minimal rate r such that the linear matrix inequality 
in (11) is satisfied, and we plot —1/ log r as a function of k. 


Note that when po = k^, the matrix is given by 


= 


—2k 


-2e 


K 2 E -(- k2 
-2 


(0 - - 0) < - 0), 

for all k. It follows that 

110 - oil < vApIIO - Olk^- 

□ 


and so the linear matrix inequality in (11) depends only 
on K and not on m and L. Therefore, we will consider 
step sizes of this form (recall from (4) that p = {rhL) 2 pg). 
The choice e = 0 is common in the literature (Giselsson 
& Boyd, 2014), but requires the user to know the strong- 
convexity parameter m. We also consider the choice e = 


The conclusion follows. 
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0.5, which produces worse guarantees, but does not require 
knowledge of m. 

One weakness of Theorem 6 is the fact that the rate we 
produce is not given as a function of k. To use Theorem 6 
as stated, we first specify the condition number (for exam¬ 
ple, K = 1000). Then we search for the minimal r such 
that (11) is feasible. This produces an upper bound on the 
convergence rate of Algorithm 2 (for example, r = 0.9). 
To remedy this problem, in Section 5, we demonstrate how 
Theorem 6 can be used to obtain the convergence rate of 
Algorithm 2 as a symbolic function of the step size p and 
the over-relaxation parameter a. 

5. Symbolic Rates for Various p and a 

In Section 4, we demonstrated how to use semidefinite pro¬ 
gramming to produce numerical convergence rates. That 
is, given a choice of algorithm parameters and the condi¬ 
tion number k, we could determine the convergence rate 
of Algorithm 2. In this section, we show how Theorem 6 
can be used to prove symbolic convergence rates. That 
is, we describe the convergence rate of Algorithm 2 as a 
function of p, a, and k. In Theorem 7, we prove the lin¬ 
ear convergence of Algorithm 2 for all choices a G (0, 2) 
and p = {rhL)^K^, with e G (— 00 , 00 ). This result gen¬ 
eralizes a number of results in the literature. As two exam¬ 
ples, Giselsson & Boyd (2014) consider the case e = 0 and 
Deng & Yin (2012) consider the case a = 1 and e = 0.5. 

The rate given in Theorem 7 is loose by a factor of four 
relative to the lower bound given in Theorem 8. However, 
weakening the rate by a constant factor eases the proof by 
making it easier to find a certificate for use in (11). 

Theorem 7. Suppose that Assumption 3 holds. Let the se¬ 
quences (xk), (zk), ond (uk) be generated by running Al¬ 
gorithm 2 with parameter a G (0, 2) and with step size p = 
(mL)^K^, where e G (— 00 , 00 ). Define a;*, z*, rt*, ipk, 
and (/5* as in Theorem 6. Then for all sufficiently large k, 
we have 

( a 

\\Tk - V^*|| < C\\po - ip4 (^1 - ^^o.5+|e| ) ’ 

where 

C = 

Proof. We claim that for all sufficiently large k, the linear 
matrix inequality in (11) is satisfied with the rate r = 1 — 
2 ^o. 5 -nei with certificate 

A2 = a P = \ ^ , “7^. 

a — 1 1 

The matrix on the right hand side of (11) can be expressed 
as where M is a symmetric 4x4 matrix whose 


last row and column consist of zeros. We wish to prove 
that M is positive semidefinite for all sufficiently large k. 
To do so, we consider the cases e > 0 and e < 0 separately, 
though the two cases will be nearly identical. First suppose 
that e > 0. In this case, the nonzero entries of M are 
specified by 

Mil = -\- 4k5“^ 

Mi 2 = -f 12 k ^~^ — 4aK5“^ 

A7i3 = 4:K -\- Sk^ ^ 

M 22 = -\- -\- 4^5“® 

^^23 = 4k -I- 8 k^ — 4q;k^ -I- 8 k5“^ 

M 33 = 8 k -f 8 k^ - 4aK^ -f 8 k 5"^ -f 8 k5+^ 

We show that each of the first three leading principal mi¬ 
nors of M is positive for sufficiently large k. To understand 
the behavior of the leading principal minors, it suffices to 
look at their leading terms. For large k, the first leading 
principal minor (which is simple Mu) is dominated by 
the term 4k 2 which is positive. Similarly, the second 
leading principal minor is dominated by the term 16(2 — 
a)K 2 “'^, which is positive. When e > 0, the third leading 
principal minor is dominated by the term 128(2 — a)K®, 
which is positive. When e = 0, the third leading principal 
minor is dominated by the term 64a(2 — a)^K®, which is 
positive. Since these leading coefficients are all positive, 
it follows that for all sufficiently large k, the matrix M is 
positive semidefinite. 

Now suppose that e < 0. In this case, the nonzero entries 
of M are specified by 

Mil = 8 k5“® - 4k 5+® -f aK^+^^ 

Mi2 = 8k5-® -p 4k 5+® - 4aK5+® - aK^+^® -P a^K^+^® 
Mi 3 = 4k -P 8k5“^ 

M 22 = 8 k^ - 4aK^ -P 8 k3"® - 4k 5+'' -P aK^+^® 

M 23 = 4k -P 8 k^ — 4aK^ -P Sks”"^ 

= 8k -p 8 k^ — 4 o;k^ -p 8 k 2 ^ -p 8 k 2”^^. 

As before, we show that each of the first three leading 
principal minors of M is positive. For large k, the first 
leading principal minor (which is simple Mu) is domi¬ 
nated by the term 8k2“'^, which is positive. Similarly, 
the second leading principal minor is dominated by the 
term 32(2 — Q!)k 2 “'^, which is positive. The third leading 
principal minor is dominated by the term 128(2 — q;)k^, 
which is positive. Since these leading coefficients are all 
positive, it follows that for all sufficiently large k, the ma¬ 
trix M is positive semidefinite. 

The result now follows from Theorem 6 by noting that P 
has eigenvalues a and 2 — a. □ 
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Note that since the matrix P doesn’t depend on p, the proof 
holds even when the step size changes at each iteration. 


6. Lower Bounds 

In this section, we probe the tightness of the upper bounds 
on the convergence rate of Algorithm 2 given by Theo¬ 
rem 6. The construction of the lower bound in this section 
is similar to a construction given in Ghadimi et al. (2015). 

Let Q be a d-dimensional symmetric positive-definite ma¬ 
trix whose largest and smallest eigenvalues are L and m 
respectively. Let f{x) = Qx be a quadratic and 
let g[z) = for some S > 0. Let A = Id, B = —Id, 

and c = 0. With these definitions, the optimization prob¬ 
lem in (1) is solved by a; = z = 0. The updates for Algo¬ 
rithm 2 are given by 

Xk+ 1 = p{Q + pI)~^{zk-Uk) (13a) 

Zfc+i = y-^(aa;fe+i-f (1 - a)zfc-f Mfc) (13b) 

0 + p 

Uk+i = Mfc + axk+i + (1 - a)zk - Zfc+i. (13c) 


Solving for Zk in (13b) and substituting the result into (13c) 
gives Uk+i = p-^fe-i-r- Then eliminating Xk+i Uk from 
(13b) using (13a) and the fact that Uk = allows us to 
express the update rule purely in terms of z as 




ap{p - (5) 


(Q + piy 


p — ap + S 


I Zk- 


Note that the eigenvalues of T are given by 


ap{X + S) 

{p + S){\ + p)’ 


(14) 


where A is an eigenvalue of Q. We will use this setup to 
construct a lower bound on the worst-case convergence rate 
of Algorithm 2 in Theorem 8. 

Theorem 8. Suppose that Assumption 3 holds. The worst- 
case convergence rate of Algorithm 2, when run with step 
size p = (rhL) s k® and over-relaxation parameter a, is 
lower-bounded by 


1 - 


2q; 


1 -f «;0-5 +|e| ■ 


(15) 


rate given exactly by (16), which is lower bounded by the 
expression in (15) when e > 0. 

Now suppose that e < 0. Choosing S = L and X = L, 
after multiplying the numerator and denominator of (14) 
by we see that T has eigenvalue 

2a ^ 2a 

“ (1 + K0.5-e)(^-0.5-re + 1) - “ 1 + ^O.S-e ' ^ ’ 

When initialized with z as the eigenvector corresponding 
to this eigenvalue. Algorithm 2 will converge linearly with 
rate given exactly by the left hand side of (17), which is 
lower bounded by the expression in (15) when e < 0. □ 

Figure 3 compares the lower bounds given by (16) with 
the upper bounds given by Theorem 6 for a = 1.5 and 
for several choices of p = {rhL)^K’^ satisfying £ > 0. 
The upper and lower bounds agree visually on the range of 
choices e depicted, demonstrating the practical tightness of 
the upper bounds given by Theorem 6 for a large range of 
choices of parameter values. 



Figure 3. For a = 1.5 and for several choices e in po = we 
plot — 1/ log r as a function of n, both for the lower bound on r 
given by (16) and the upper bound on r given by Theorem 6. For 
each choice of e in {0.5, 0.25,0}, the lower and upper bounds 
agree visually. This agreement demonstrates the practical tight¬ 
ness of the upper bounds given by Theorem 6 for a large range of 
choices of parameter values. 


7. Related Work 


Proof. First consider the case e > 0. Choosing i5 = 0 
and X = m, from (14), we see that T has eigenvalue 


1 - 


1 -F kO-S-He ■ 


(16) 


When initialized with z as the eigenvector corresponding 
to this eigenvalue. Algorithm 2 will converge linearly with 


Several recent papers have studied the linear convergence 
of Algorithm 1 but do not extend to Algorithm 2. Deng & 
Yin (2012) prove a linear rate of convergence for ADMM 
in the strongly convex case. lutzeler et al. (2014) prove the 
linear convergence of a specialization of ADMM to a class 
of distributed optimization problems under a local strong- 
convexity condition. Hong & Luo (2012) prove the linear 
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convergence of a generalization of ADMM to a multiterm 
objective in the setting where each term can be decomposed 
as a strictly convex function and a polyhedral function. In 
particular, this result does not require strong convexity. 

More generally, there are a number of results for opera¬ 
tor splitting methods in the literature. Lions & Mercier 
(1979) and Eckstein & Ferris (1998) analyze the conver¬ 
gence of several operator splitting schemes. More re¬ 
cently, Patrinos et al. (2014a;b) prove the equivalence of 
forward-backward splitting and Douglas-Rachford split¬ 
ting with a scaled version of the gradient method applied 
to unconstrained nonconvex surrogate functions (called the 
forward-backward envelope and the Douglas-Rachford en¬ 
velope respectively). Goldstein et al. (2012) propose an 
accelerated version of ADMM in the spirit of Nesterov, 
and prove a 0(l/fc^) convergence rate in the case where / 
and g are both strongly convex and g is quadratic. 

The theory of over-relaxed ADMM is more limited. Eck¬ 
stein & Bertsekas (1992) prove the convergence of over¬ 
relaxed ADMM but do not give a rate. More recently, 
Davis & Yin (2014a;b) analyze the convergence rates of 
ADMM in a variety of settings. Giselsson & Boyd (2014) 
prove the linear convergence of Douglas-Rachford split¬ 
ting in the strongly-convex setting. They use the fact that 
ADMM is Douglas-Rachford splitting applied to the dual 
problem (Eckstein & Bertsekas, 1992) to derive a linear 
convergence rate for over-relaxed ADMM with a specific 
choice of step size p. Eckstein (1994) gives convergence re¬ 
sults for several specializations of ADMM, and found that 
over-relaxation with a = 1.5 empirically sped up conver¬ 
gence. Ghadimi et al. (2015) give some guidance on tuning 
over-relaxed ADMM in the quadratic case. 

Unlike prior work, our framework requires no assumptions 
on the parameter choices in Algorithm 2. For example. 
Theorem 6 certifies the linear convergence of Algorithm 2 
even for values a > 2. In our framework, certifying a con¬ 
vergence rate for an arbitrary choice of parameters amounts 
to checking the feasibility of a 4 x 4 semidefinite program, 
which is essentially instantaneous, as opposed to formulat¬ 
ing a proof. 

8. Selecting Algorithm Parameters 

In this section, we show how to use the results of Section 4 
to select the parameters a and p in Algorithm 2 and we 
show the effect on a numerical example. 

Recall that given a choice of parameters a and p and given 
the condition number k. Theorem 6 gives an upper bound 
on the convergence rate of Algorithm 2. Therefore, one ap¬ 
proach to parameter selection is to do a grid search over 
the space of parameters for the choice that minimizes the 
upper bound provided by Theorem 6. We demonstrate 


this approach numerically for a distributed Lasso problem, 
but first we demonstrate that the usual range of (0, 2) for 
the over-relaxation parameter a is too limited, that more 
choices of a lead to linear convergence. In Figure 4, we 
plot the largest value of a found through binary search such 
that (11) is satisfied for some t < 1 as a function of k. 
Proof techniques in prior work do not extend as easily to 
values of a > 2. In our framework, we simply change 
some constants in a small semidefinite program. 



Figure 4. As a function of k, we plot the largest value of a such 
that (11) is satisfied for some r < 1. In this figure, we set £ = 0 
in po = ■ 

8.1. Distributed Lasso 

Following Deng & Yin (2012), we give a numerical demon¬ 
stration with a distributed Lasso problem of the form 

^ 1 

minimize ^\\AiX^ - biW^ + \\z\\i 

subject to Xi — z = Q for all i = \^... ,N . 

Each Ai is a tall matrix with full column rank, and so the 
first term in the objective will be strongly convex and its 
gradient will be Lipschitz continuous. As in Deng & Yin 
(2012), we choose N = 5 and p, = 0.1. Each Ai is 
generated by populating a 600 x 500 matrix with indepen¬ 
dent standard normal entries and normalizing the columns. 
We generate each bi via bi = AiX^ -b Si, where is a 
sparse 500-dimensional vector with 250 independent stan¬ 
dard normal entries, and £i ^ JV{0, 10“^/). 

In Figure 5, we compute the upper bounds on the conver¬ 
gence rate given by Theorem 6 for a grid of values of a 
and p. Each line corresponds to a fixed choice of a, and 
we plot only a subset of the values of a to keep the plot 
manageable. We omit points corresponding to parameter 
values for which the linear matrix inequality in (11) was 
not feasible for any value of r < 1. 

In Figure 6, we run Algorithm 2 for the same values of a 




A General Analysis of the Convergence of ADMM 



Figure 5. We compute the upper bounds on the convergence rate 
given by Theorem 6 for eighty-five values of a evenly spaced 
between 0.1 and 2.2 and fifty values of p geometrically spaced 
between 0.1 and 10. Each line corresponds to a fixed choice of a, 
and we show only a subset of the values of a to keep the plot 
manageable. We omit points corresponding to parameter values 
for which ( 11 ) is not feasible for any value of r < 1. This analysis 
suggests choosing a? = 2.0 and p = 1.7. 

and p. We then plot the number of iterations needed for Zk 
to reach within 10“® of a precomputed reference solution. 
We plot lines corresponding to only a subset of the values 
of a to keep the plot manageable, and we omit points cor¬ 
responding to parameter values for which Algorithm 2 ex¬ 
ceeded 1000 iterations. For the most part, the performance 
of Algorithm 2 as a function of p closely tracked the perfor¬ 
mance predicted by the upper bounds in Figure 5. Notably, 
smaller values of a seem more robust to poor choices of p. 
The parameters suggested by our analysis perform close to 
the best of any parameter choices. 

9. Discussion 

We showed that a framework based on semidefinite pro¬ 
gramming can be used to prove convergence rates for the 
alternating direction method of multipliers and allows a 
unified treatment of the algorithm’s many variants, which 
arise through the introduction of additional parameters. We 
showed how to use this framework for establishing conver¬ 
gence rates, as in Theorem 6 and Theorem 7, and how to 
use this framework for parameter selection in practice, as 
in Section 8. The potential uses are numerous. This frame¬ 
work makes it straightforward to propose new algorithmic 
variants, for example, by introducing new parameters into 
Algorithm 2 and using Theorem 6 to see if various settings 
of these new parameters give rise to improved guarantees. 

In the case that Assumption 3 does not hold, the most likely 
cause is that we lack the strong convexity of /. One ap¬ 



Figure 6. We run Algorithm 2 for eighty-five values of a. evenly 
spaced between 0.1 and 2.2 and fifty value of p geometrically 
spaced between 0.1 and 10. We plot the number of iterations 
required for z-k to reach within 10“® of a precomputed reference 
solution. We show only a subset of the values of a to keep the plot 
manageable. We omit points corresponding to parameter values 
for which Algorithm 2 exceeded 1000 iterations. 

proach to handling this is to run Algorithm 2 on the modi¬ 
fied function /(x) + §||tc|p. By completing the square in 
the X update, we see that this amounts to an extremely mi¬ 
nor algorithmic modification (it only affects the x update). 

It should be clear that other operator splitting methods such 
as Douglas-Rachford splitting and forward-backward split¬ 
ting can be cast in this framework and analyzed using the 
tools presented here. 
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